程序代写代做代考 algorithm html COMP6714: Informa2on Retrieval & Web Search

COMP6714: Informa2on Retrieval & Web Search 
Introduc)on to 
Informa(on Retrieval 
Lecture 6: Scoring, Term Weigh)ng and the  Vector Space Model 
1 

COMP6714: Informa2on Retrieval & Web Search 
Recap of lecture 5 
 Collec)on and vocabulary sta)s)cs: Heaps’ and Zipf’s laws   Dic)onary compression for Boolean indexes 
 Dic)onary string, blocks, front coding 
 Pos)ngs compression: Gap encoding, prefix‐unique codes 
 Variable‐Byte, Gamma codes, Golomb/Rice codes  MB
collection (text, xml markup etc)
3,600.0
collection (text)
960.0
Term-doc incidence matrix
40,000.0
postings, uncompressed (32-bit words)
400.0
postings, uncompressed (20 bits)
250.0
postings, variable byte encoded
116.0
postings, γ-encoded
101.0
2 

COMP6714: Informa2on Retrieval & Web Search 
This lecture; IIR Sec)ons 6.2‐6.4.3 
 Ranked retrieval 
 Scoring documents   Term frequency 
 Collec)on sta)s)cs   Weigh)ng schemes   Vector space scoring 
3 

COMP6714: Informa2on Retrieval & Web Search 
Ch. 6
Ranked retrieval 
 Thus far, our queries have all been Boolean. 
 Documents either match or don’t. 
 Good for expert users with precise understanding of 
their needs and the collec)on. 
 Also good for applica)ons: Applica)ons can easily  consume 1000s of results. 
 Not good for the majority of users. 
 Most users incapable of wri)ng Boolean queries (or they 
are, but they think it’s too much work). 
 Most users don’t want to wade through 1000s of results.   This is par)cularly true of web search. 
4 

COMP6714: Informa2on Retrieval & Web Search 
Ch. 6
Problem with Boolean search:  feast or famine 
 Boolean queries o^en result in either too few (=0) or  too many (1000s) results. 
 Query 1: “standard user dlink 650” → 200,000 hits 
 Query 2: “standard user dlink 650 no card found”: 0 
hits 
 It takes a lot of skill to come up with a query that  produces a manageable number of hits. 
 AND gives too few; OR gives too many 
5 

COMP6714: Informa2on Retrieval & Web Search 
Ranked retrieval models 
 Rather than a set of documents sa)sfying a query  expression, in ranked retrieval models, the system  returns an ordering over the (top) documents in the  collec)on with respect to a query 
 Free text queries: Rather than a query language of  operators and expressions, the user’s query is just  one or more words in a human language 
 In principle, there are two separate choices here, but  in prac)ce, ranked retrieval models have normally  been associated with free text queries and vice versa 
6 

COMP6714: Informa2on Retrieval & Web Search 
Ch. 6
Feast or famine: not a problem in  ranked retrieval 
 When a system produces a ranked result set, large  result sets are not an issue 
 Indeed, the size of the result set is not an issue   We just show the top k ( ≈ 10) results 
 We don’t overwhelm the user 
 Premise: the ranking algorithm works 
7 

COMP6714: Informa2on Retrieval & Web Search 
Ch. 6
Scoring as the basis of ranked retrieval 
 We wish to return in order the documents most likely  to be useful to the searcher 
 How can we rank‐order the documents in the  collec)on with respect to a query? 
 Assign a score – say in [0, 1] – to each document 
 This score measures how well document and query  “match”. 
8 

COMP6714: Informa2on Retrieval & Web Search 
Ch. 6
Query‐document matching scores 
 We need a way of assigning a score to a query/ document pair 
 Let’s start with a one‐term query 
 If the query term does not occur in the document: 
score should be 0 
 The more frequent the query term in the document,  the higher the score (should be) 
 We will look at a number of alterna)ves for this. 
9 

COMP6714: Informa2on Retrieval & Web Search 
Ch. 6
Take 1: Jaccard coefficient 
 Recall from Lecture 3: A commonly used measure of 
overlap of two sets A and B 
 jaccard(A,B) = |A ∩ B| / |A ∪ B| 
 jaccard(A,A) = 1 
 jaccard(A,B) = 0 if A ∩ B = 0 
 A and B don’t have to be the same size. 
 Always assigns a number between 0 and 1. 
10 

COMP6714: Informa2on Retrieval & Web Search 
Ch. 6
Jaccard coefficient: Scoring example 
 What is the query‐document match score that the  Jaccard coefficient computes for each of the two  documents below? 
 Query: ides of march 
 Document 1: caesar died in march 
 Document 2: the long march 
the term Ides of March is best known as the date that  Julius Caesar was killed in 709 AUC or 44 B.C 
11 

COMP6714: Informa2on Retrieval & Web Search 
Ch. 6
Issues with Jaccard for scoring 
1 It doesn’t consider term frequency (how many )mes  a term occurs in a document) 
2 Rare terms in a collec)on are more informa)ve  than frequent terms. Jaccard doesn’t consider this  informa)on 
 We need a more sophis)cated way of normalizing for  length 
12 

COMP6714: Informa2on Retrieval & Web Search 
Sec. 6.2
Recall (Lecture 1): Binary term‐ document incidence matrix 
Each document is represented by a binary vector ∈ {0,1}|V| 13 

COMP6714: Informa2on Retrieval & Web Search 
Sec. 6.2
Term‐document count matrices 
 Consider the number of occurrences of a term in a  document:  
 Each document is a count vector in Nv: a column below  
14 

COMP6714: Informa2on Retrieval & Web Search 
Bag of words model 
 Vector representa)on doesn’t consider the ordering 
of words in a document 
 John is quicker than Mary and Mary is quicker than  John have the same vectors 
 This is called the bag of words model. 
 In a sense, this is a step back: The posi)onal index 
was able to dis)nguish these two documents. 
 We will look at “recovering” posi)onal informa)on  later in this course. 
 For now: bag of words model 
15 

COMP6714: Informa2on Retrieval & Web Search 
Term frequency t 
 The term frequency tt,d of term t in document d is 
defined as the number of )mes that t occurs in d. 
 We want to use t when compu)ng query‐document 
match scores. But how? 
 Raw term frequency is not what we want: 
 A document with 10 occurrences of the term is more 
relevant than a document with 1 occurrence of the term.   But not 10 )mes more relevant. 
 Relevance does not increase propor)onally with 
term frequency. 
NB: frequency = count in IR 
16 

COMP6714: Informa2on Retrieval & Web Search 
Sec. 6.2
Log‐frequency weigh)ng 
 The log frequency weight of term t in d is 
 0 → 0, 1 → 1, 2 → 1.3, 10 → 2, 1000 → 4, etc. 
 Score for a document‐query pair: sum over terms t in 
both q and d: 
 score 
 The score is 0 if none of the query terms is present in 
the document. 
17 

COMP6714: Informa2on Retrieval & Web Search 
Sec. 6.2.1
Document frequency 
 Rare terms are more informa)ve than frequent terms 
 Recall stop words 
 Consider a term in the query that is rare in the 
collec)on (e.g., arachnocentric) 
 A document containing this term is very likely to be 
relevant to the query arachnocentric 
 → We want a high weight for rare terms like  arachnocentric. 
18 

COMP6714: Informa2on Retrieval & Web Search 
Sec. 6.2.1
Document frequency, con)nued 
 Frequent terms are less informa)ve than rare terms 
 Consider a query term that is frequent in the  collec)on (e.g., high, increase, line) 
 A document containing such a term is more likely to  be relevant than a document that doesn’t 
 But it’s not a sure indicator of relevance. 
 → For frequent terms, we want high posi)ve weights 
for words like high, increase, and line 
 But lower weights than for rare terms. 
 We will use document frequency (df) to capture this.  19 

COMP6714: Informa2on Retrieval & Web Search 
Sec. 6.2.1
idf weight 
 dft is the document frequency of t: the number of  documents that contain t 
 dft is an inverse measure of the informa)veness of t   dft  ≤ N 
 We define the idf (inverse document frequency) of t  by 
 We use log (N/dft) instead of N/dft to “dampen” the effect  of idf. 
Will turn out the base of the log is immaterial.
20 

COMP6714: Informa2on Retrieval & Web Search 
Sec. 6.2.1
idf example, suppose N = 1 million 
term
dft
idft
calpurnia
1
animal
100
sunday
1,000
fly
10,000
under
100,000
the
1,000,000
There is one idf value for each term t in a collection. 21 

COMP6714: Informa2on Retrieval & Web Search 
Effect of idf on ranking 
 Does idf have an effect on ranking for one‐term  queries, like 
 iPhone 
 idf has no effect on ranking one term queries 
 idf affects the ranking of documents for queries with at  least two terms 
 For the query capricious person, idf weigh)ng makes  occurrences of capricious count for much more in the final  document ranking than occurrences of person. 
22 

COMP6714: Informa2on Retrieval & Web Search 
Sec. 6.2.1
Collec)on vs. Document frequency 
 The collec)on frequency of t is the number of  occurrences of t in the collec)on, coun)ng  mul)ple occurrences. 
 Example: 
Word
Collection frequency
Document frequency
insurance
10440
3997
try
10422
8760
 Which word is a bever search term (and should  get a higher weight)? 
23 

COMP6714: Informa2on Retrieval & Web Search 
Sec. 6.2.2
t‐idf weigh)ng 
 The t‐idf weight of a term is the product of its t  weight and its idf weight. 
 Best known weigh)ng scheme in informa)on retrieval   Note: the “‐” in t‐idf is a hyphen, not a minus sign! 
 Alterna)ve names: t.idf, t x idf 
 Increases with the number of occurrences within a  document 
 Increases with the rarity of the term in the collec)on  24 

COMP6714: Informa2on Retrieval & Web Search 
Sec. 6.2.2
Final ranking of documents for a query 
Score(q,d)=∑t∈q∩d tf.idft,d
25 

COMP6714: Informa2on Retrieval & Web Search 
Sec. 6.3
Binary → count → weight matrix 
Each document is now represented by a real-valued vector of tf-idf weights ∈ R|V|
26 

COMP6714: Informa2on Retrieval & Web Search 
Sec. 6.3
Documents as vectors 
 So we have a |V|‐dimensional vector space 
 Terms are axes of the space 
 Documents are points or vectors in this space 
 Very high‐dimensional: tens of millions of dimensions  when you apply this to a web search engine 
 These are very sparse vectors ‐ most entries are zero. 
27 

COMP6714: Informa2on Retrieval & Web Search 
Sec. 6.3
Queries as vectors 
 Key idea 1: Do the same for queries: represent them  as vectors in the space 
 Key idea 2: Rank documents according to their  proximity to the query in this space 
 proximity = similarity of vectors 
 proximity ≈ inverse of distance 
 Recall: We do this because we want to get away from  the you’re‐either‐in‐or‐out Boolean model. 
 Instead: rank more relevant documents higher than  less relevant documents 
28 

COMP6714: Informa2on Retrieval & Web Search 
Sec. 6.3
Formalizing vector space proximity   First cut: distance between two points 
 ( = distance between the end points of the two vectors)   Euclidean distance? 
 Euclidean distance is a bad idea . . . 
 . . . because Euclidean distance is large for vectors of  different lengths. 
29 

COMP6714: Informa2on Retrieval & Web Search 
Sec. 6.3
Why distance is a bad idea  The Euclidean distance 
between q  and d2 is large even 
though the 
distribu)on of terms  in the query q and the  distribu)on of 
terms in the  document d2 are 
very similar. 
30 

COMP6714: Informa2on Retrieval & Web Search 
Sec. 6.3
Use angle instead of distance 
 Thought experiment: take a document d and append  it to itself. Call this document d′. 
 “Seman)cally” d and d′ have the same content 
 The Euclidean distance between the two documents 
can be quite large 
 The angle between the two documents is 0,  corresponding to maximal similarity. 
 Key idea: Rank documents according to angle with  query. 
31 

COMP6714: Informa2on Retrieval & Web Search 
Sec. 6.3
From angles to cosines 
 The following two no)ons are equivalent. 
 Rank documents in decreasing order of the angle between  query and document 
 Rank documents in increasing order of  cosine(query,document) 
 Cosine is a monotonically decreasing func)on for the  interval [0o, 180o] 
32 

COMP6714: Informa2on Retrieval & Web Search 
Sec. 6.3
From angles to cosines 
 But how – and why – should we be compu)ng cosines? 
33 

COMP6714: Informa2on Retrieval & Web Search 
Sec. 6.3
Length normaliza)on 
 A vector can be (length‐) normalized by dividing each  of its components by its length – for this we use the  L2 norm: 
 Dividing a vector by its L2 norm makes it a unit  (length) vector (on surface of unit hypersphere) 
 Effect on the two documents d and d′ (d appended  to itself) from earlier slide: they have iden)cal  vectors a^er length‐normaliza)on. 
 Long and short documents now have comparable weights  34 

COMP6714: Informa2on Retrieval & Web Search 
Sec. 6.3
cosine(query,document) 
Dot product
Unit vectors
qi is the tf-idf weight of term i in the query
di is the tf-idf weight of term i in the document
cos(q,d) is the cosine similarity of q and d … or, equivalently, the cosine of the angle between q and d.
35 

COMP6714: Informa2on Retrieval & Web Search 
Cosine for length‐normalized vectors 
 For length‐normalized vectors, cosine similarity is  simply the dot product (or scalar product): 
cos(q,d) = q • d = ∑V qidi i=1
                                   for q, d length‐normalized. 
36 
€

COMP6714: Informa2on Retrieval & Web Search 
Cosine similarity illustrated 
37 

COMP6714: Informa2on Retrieval & Web Search 
Sec. 6.3
Cosine similarity amongst 3 documents 
How similar are  the novels  SaS: Sense and  Sensibility  PaP: Pride and  Prejudice, and  WH: Wuthering  Heights? 
Term frequencies (counts)
term
SaS
PaP
WH
affection
115
58
20
jealous
10
7
11
gossip
2
0
6
wuthering
0
0
38
Note: To simplify this example, we don’t do idf weighting.
38 

COMP6714: Informa2on Retrieval & Web Search 
Sec. 6.3
3 documents example contd. 
Log frequency weigh(ng  A9er length normaliza(on 
term
SaS
PaP
WH
affection
3.06
2.76
2.30
jealous
2.00
1.85
2.04
gossip
1.30
0
1.78
wuthering
0
0
2.58
term
SaS
PaP
WH
affection
0.789
0.832
0.524
jealous
0.515
0.555
0.465
gossip
0.335
0
0.405
wuthering
0
0
0.588
cos(SaS,PaP) ≈
0.789 × 0.832 + 0.515 × 0.555 + 0.335 × 0.0 + 0.0 × 0.0
≈ 0.94
cos(SaS,WH) ≈ 0.79 cos(PaP,WH) ≈ 0.69
Why do we have cos(SaS,PaP) > cos(SaS,WH)?
39 

COMP6714: Informa2on Retrieval & Web Search 
Sec. 6.3
Compu)ng cosine scores 
40 

COMP6714: Informa2on Retrieval & Web Search 
Sec. 6.4
t‐idf weigh)ng has many variants 
Columns headed ‘n’ are acronyms for weight schemes.
Why is the base of the log in idf immaterial?
41 

COMP6714: Informa2on Retrieval & Web Search 
Sec. 6.4
Weigh)ng may differ in queries vs  documents 
 Many search engines allow for different weigh)ngs  for queries vs. documents 
 SMART Nota)on: denotes the combina)on in use in  an engine, with the nota)on ddd.qqq, using the  acronyms from the previous table 
 A very standard weigh)ng scheme is: lnc.ltc 
 Document: logarithmic t (l as first character), no idf and 
cosine normaliza)on 
A bad idea?
 Query: logarithmic t (l in le^most column), idf (t in second  column), cosine normaliza)on … 
42 

COMP6714: Informa2on Retrieval & Web Search 
auto
best
car
t‐idf example: lnc.ltc 
Document: car insurance auto insurance Query: best car insurance
insurance
0
1
1
1
0
1
1
1
5000
50000
10000
1000
2.3
1.3
2.0
3.0
1.3
2.0
3.0
0
0.34
0.52
0.78
0
1
0
1
2
1.3
1
0
1
wt
1.3
1
0
1
Sec. 6.4
Term
tf- raw
tf-wt
df
Query
0.52
0.52
0.68
0.27
0.53
0
0
0
Exercise: what is N, the number of docs?
Doc length =
idf
wt
n’lize
tf-raw
Document
tf-wt
n’lize
Prod
12 + 02 +12 +1.32 ≈1.92 Score = 0+0+0.27+0.53 = 0.8
43 

COMP6714: Informa2on Retrieval & Web Search 
Summary – vector space ranking 
 Represent the query as a weighted t‐idf vector 
 Represent each document as a weighted t‐idf vector 
 Compute the cosine similarity score for the query  vector and each document vector 
 Rank documents with respect to the query by score 
 Return the top K (e.g., K = 10) to the user 
44 

COMP6714: Informa2on Retrieval & Web Search 
Ch. 6
Resources for today’s lecture   IIR 6.2 – 6.4.3 
 hvp://www.miislita.com/informa)on‐retrieval‐ tutorial/cosine‐similarity‐tutorial.html 
 Term weigh)ng and cosine similarity tutorial for SEO folk! 
45

Related Posts